12 research outputs found

    Abstractive spoken document summarization using hierarchical model with multi-stage attention diversity optimization

    Get PDF
    Abstractive summarization is a standard task for written documents, such as news articles. Applying summarization schemes to spoken documents is more challenging, especially in situations involving human interactions, such as meetings. Here, utterances tend not to form complete sentences and sometimes contain little information. Moreover, speech disfluencies will be present as well as recognition errors for automated systems. For current attention-based sequence-to-sequence summarization systems, these additional challenges can yield a poor attention distribution over the spoken document words and utterances, impacting performance. In this work, we propose a multi-stage method based on a hierarchical encoder-decoder model to explicitly model utterance-level attention distribution at training time; and enforce diversity at inference time using a unigram diversity term. Furthermore, multitask learning tasks including dialogue act classification and extractive summarization are incorporated. The performance of the system is evaluated on the AMI meeting corpus. The inclusion of both training and inference diversity terms improves performance, outperforming current state-of-the-art systems in terms of ROUGE scores. Additionally, the impact of ASR errors, as well as performance on the multitask learning tasks, is evaluated

    Impact of ASR performance on spoken grammatical error detection

    Get PDF
    Computer assisted language learning (CALL) systems aidlearners to monitor their progress by providing scoring andfeedback on language assessment tasks. Free speaking tests al-low assessment of what a learner has said, as well as how theysaid it. For these tasks, Automatic Speech Recognition (ASR)is required to generate transcriptions of a candidate’s responses,the quality of these transcriptions is crucial to provide reliablefeedback in downstream processes. This paper considers theimpact of ASR performance on Grammatical Error Detection(GED) for free speaking tasks, as an example of providing feed-back on a learner’s use of English. The performance of an ad-vanced deep-learning based GED system, initially trained onwritten corpora, is used to evaluate the influence of ASR errors.One consequence of these errors is that grammatical errors canresult from incorrect transcriptions as well as learner errors, thismay yield confusing feedback. To mitigate the effect of theseerrors, and reduce erroneous feedback, ASR confidence scoresare incorporated into the GED system. By additionally adaptingthe written text GED system to the speech domain, using ASRtranscriptions, significant gains in performance can be achieved.Analysis of the GED performance for different grammatical er-ror types and across grade is also presented.ALT

    Long-span summarization via local attention and content selection

    No full text
    Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization. Typically these systems are trained by fine-tuning a large pre-trained model to the target task. One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows. Thus, for long document summarization, it can be challenging to train or fine-tune these models. In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization using two methods: local self-attention; and explicit content selection. These approaches are compared on a range of network configurations. Experiments are carried out on standard long-span summarization tasks, including Spotify Podcast, arXiv, and PubMed datasets. We demonstrate that by combining these methods, we can achieve state-of-the-art results on all three tasks in the ROUGE scores. Moreover, without a large-scale GPU card, our approach can achieve comparable or better results than existing approaches.1. ALTA institute, Cambridge Assessment English, University of Cambridge 2. Cambridge International & St John’s College Scholarshi

    Long-Span Summarization via Local Attention and Content Selection

    No full text
    Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization. Typically these systems are trained by fine-tuning a large pre-trained model to the target task. One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows. Thus, for long document summarization, it can be challenging to train or fine-tune these models. In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization using two methods: local self-attention; and explicit content selection. These approaches are compared on a range of network configurations. Experiments are carried out on standard long-span summarization tasks, including Spotify Podcast, arXiv, and PubMed datasets. We demonstrate that by combining these methods, we can achieve state-of-the-art results on all three tasks in the ROUGE scores. Moreover, without a large-scale GPU card, our approach can achieve comparable or better results than existing approaches

    Abstractive spoken document summarization using hierarchical model with multi-stage attention diversity optimization

    No full text
    Abstractive summarization is a standard task for written documents, such as news articles. Applying summarization schemes to spoken documents is more challenging, especially in situations involving human interactions, such as meetings. Here, utterances tend not to form complete sentences and sometimes contain little information. Moreover, speech disfluencies will be present as well as recognition errors for automated systems. For current attention-based sequence-to-sequence summarization systems, these additional challenges can yield a poor attention distribution over the spoken document words and utterances, impacting performance. In this work, we propose a multi-stage method based on a hierarchical encoder-decoder model to explicitly model utterance-level attention distribution at training time; and enforce diversity at inference time using a unigram diversity term. Furthermore, multitask learning tasks including dialogue act classification and extractive summarization are incorporated. The performance of the system is evaluated on the AMI meeting corpus. The inclusion of both training and inference diversity terms improves performance, outperforming current state-of-the-art systems in terms of ROUGE scores. Additionally, the impact of ASR errors, as well as performance on the multitask learning tasks, is evaluated

    Disfluency Detection for Spoken Learner English

    No full text
    One of the challenges for computer aided language learning (CALL) is providing high quality feedback to learners. An obstacle to improving feedback is the lack of labelled training data for tasks such as spoken”grammatical” error detection and correction, both of which provide important features that can be used in downstream feedback systems One approach to addressing this lack of data is to convert the output of an automatic speech recognition (ASR) system into a form that is closer to text data, for which there is significantly more labelled data available. Disfluency detection, locating regions of the speech where for example false starts and repetitions occur, and subsequent removal of the associated words, helps to make speech transcriptions more text-like. Additionally, ASR systems do not usually generate sentence-like units, the output is simply a sequence of words associated with the particular speech segmentation used for coding. This motivates the need for automated systems for sentence segmentation. By combining these approaches, advanced text processing techniques should perform significantly better on the output from spoken language processing systems. Unfortunately there is not enough labelled data available to train these systems on spoken learner English. In this work disfluency detection and”sentence” segmentation systems trained on data from native speakers are applied to spoken grammatical error detection and correction tasks for learners of English. Performance gains using these approaches are shown on a free speaking test

    Impact of ASR performance on spoken grammatical error detection

    No full text
    Computer assisted language learning (CALL) systems aid learners to monitor their progress by providing scoring and feedback on language assessment tasks. Free speaking tests allow assessment of what a learner has said, as well as how they said it. For these tasks, Automatic Speech Recognition (ASR) is required to generate transcriptions of a candidate's responses, the quality of these transcriptions is crucial to provide reliable feedback in downstream processes. This paper considers the impact of ASR performance on Grammatical Error Detection (GED) for free speaking tasks, as an example of providing feedback on a learner's use of English. The performance of an advanced deep-learning based GED system, initially trained on written corpora, is used to evaluate the influence of ASR errors. One consequence of these errors is that grammatical errors can result from incorrect transcriptions as well as learner errors, this may yield confusing feedback. To mitigate the effect of these errors, and reduce erroneous feedback, ASR confidence scores are incorporated into the GED system. By additionally adapting the written text GED system to the speech domain, using ASR transcriptions, significant gains in performance can be achieved. Analysis of the GED performance for different grammatical error types and across grade is also presented

    Disfluency Detection for Spoken Learner English

    No full text
    One of the challenges for computer aided language learn-ing (CALL) is providing high quality feedback to learners. Anobstacle to improving feedback is the lack of labelled trainingdata for tasks such as spoken ”grammatical” error detection andcorrection, both of which provide important features that canbe used in downstream feedback systems One approach to ad-dressing this lack of data is to convert the output of an auto-matic speech recognition (ASR) system into a form that is closerto text data, for which there is significantly more labelled dataavailable. Disfluency detection, locating regions of the speechwhere for example false starts and repetitions occur, and subse-quent removal of the associated words, helps to make speechtranscriptions more text-like. Additionally, ASR systems donot usually generate sentence-like units, the output is simplya sequence of words associated with the particular speech seg-mentation used for coding. This motivates the need for auto-mated systems for sentence segmentation. By combining theseapproaches, advanced text processing techniques should per-form significantly better on the output from spoken languageprocessing systems. Unfortunately there is not enough labelleddata available to train these systems on spoken learner English.In this work disfluency detection and ”sentence” segmentationsystems trained on data from native speakers are applied to spo-ken grammatical error detection and correction tasks for learn-ers of English. Performance gains using these approaches areshown on a free speaking test
    corecore